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Processed pseudogenes acquired somatically 
during cancer development 
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Cancer evolves by mutation, with somatic reactivation of retrotransposons being one such 
mutational process. Germline retrotransposition can cause processed pseudogenes, but 
whether this occurs somatically has not been evaluated. Here we screen sequencing data 
from 660 cancer samples for somatically acquired pseudogenes. We find 42 events in 17 
samples, especially non-small cell lung cancer (5/27) and colorectal cancer (2/11). Genomic 
features mirror those of germline LINE element retrotranspositions, with frequent target-site 
duplications (67%), consensus TTTTAA sites at insertion points, inverted rearrangements 
(21%), 5' truncation (74%) and polyA tails (88%). Transcriptional consequences include 
expression of pseudogenes from UTRs or introns of target genes. In addition, a somatic 
pseudogene that integrated into the promoter and first exon of the tumour suppressor gene, 
MGA, abrogated expression from that allele. Thus, formation of processed pseudogenes 
represents a new class of mutation occurring during cancer development, with potentially 
diverse functional consequences depending on genomic context. 
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Mutation underpins both the evolution of species and the 
development of cancer 1 . In the germline, repetitive 
DNA, such as long interspersed elements (LINE) and 
Alu repeats, form a sizable proportion of the human genome. 
LINE elements continue to propagate in the genome through 
their retrotransposition machinery. Essentially, a functional LINE 
element, when transcribed and translated by the host machinery, 
encodes two proteins that co-ordinate reverse transcription of the 
mRNA template and its integration back into the genome at a 
distant site from the original element. Insertion of transposable 
elements has considerably reshaped the human genome over 
evolutionary time. A role for retrotransposition as a mutational 
force in somatic cells has been increasingly recognized in the last 
few years, documented to occur during both normal 
neurogenesis 2-5 and cancer development 6-10 . 

Germline pseudogenes, representing cDNA copies of mRNA 
transcripts, are a by-product of LINE-mediated retrotransposi- 
tion 11 . This copying of DNA sequence through an mRNA 
intermediate creates several distinctive genomic features of 
pseudogenes, including the presence of polyA tails, absence of 
intronic sequence and target-site duplications. In the germline, 
processed pseudogenes influence evolution through gene 
duplication, novel exons, gene fusions 12 , sequestration of 
miRNAs 13 and antisense transcript production 14 . 

Formation of processed pseudogenes has not been system- 
atically studied in cancer. From a screen of 660 cancer samples, 
we find that somatically acquired pseudogenes are present in 2.6% 
of cancers, especially lung and colorectal cancers. We find a 
diverse array of transcriptional consequences, including expres- 
sion from UTR and intronic insertions, as well as inactivation of 
transcription from exonic insertions. 

Results 

We developed bioinformatic methods to detect somatically 
acquired processed pseudogenes in massively parallel sequencing 
data from both targeted exome and genome-wide studies in 
cancer. Paired- end sequencing reads were aligned to the genome 
and transcriptome with a view to identifying reads that split 
exactly across canonical splice sites, were mapped to exons 
separated by more than the library insert size, or mapped between 
a pseudogene and its insertion site (Supplementary Fig. 1). To 
define a putative pseudogene, we required that at least three exons 
from a single gene were represented in the tumour DNA, with at 
least two canonical splice junctions directly observed from either 
split reads or confirmatory capillary sequencing. To establish that 
pseudogenes were not germline, we analysed sequencing data 
from the matched normal DNA for the given patient and 
screened for identical events in other patients. We performed 
PCR on tumour and matched normal DNA for all predicted 
pseudogenes with mapped insertion sites, and excluded variants 
with a positive germline PCR band from this analysis 
(Supplementary Fig. 2). 

We identified 42 somatically acquired pseudogenes in 14 out of 
629 primary cases and 3 out of 31 cell lines sequenced (Table 1, 
Supplementary Data 1 and Supplementary Table 1). As a typical 
example, in a lung cancer we identified an insertion of all five 
exons of the gene FOPNL, including a portion of the 5' UTR, the 
full coding sequence and the full 3' UTR, into the eleventh intron 
of SND1 in the opposite orientation (Fig. la). All four canonical 
exon-exon junctions of FOPNL were crossed by sequencing reads 
in the tumour DNA, and a polyA tail of at least 50 bp was present 
at the 3' end of the pseudogene. No such evidence was seen in the 
matched germline DNA from this patient or any other, and PCR 
confirmed the 5' and 3' insertion points as somatic. The two 
insertion points were mapped to base-pair resolution and 



revealed a target- site duplication of 10 bp that included the motif 
TTTT at either end of the pseudogene. 

We observed variations on this theme (Figs lb and 2a, 
Supplementary Fig. 3). PolyA tails were usually, but not 
universally, present, seen in 88% (21/24) pseudogenes with 
insertion sites fully mapped. Target- site duplications were 
frequent (67% with insertion sites fully mapped; 16/24), 
although in five cases target- site deletions occurred. The majority 
(74%; 31/42) of somatic pseudogenes did not contain the full 
coding sequence of the gene, as the 5' end was frequently 
truncated during reverse transcription and insertion. Two 
somatic pseudogenes were novel isoforms of their template genes 
not recorded in the Ensembl database. We did not observe any 
base substitutions in the somatic pseudogenes, implying that the 
reverse transcriptase has reasonable fidelity and the template 
transcripts had not undergone RNA editing. 

On the basis of these characteristics, somatic pseudogenes, like 
their germline counterparts, are probably a product of the reverse 
transcription and transposition capability of endogenous LINE 
elements acting on mature mRNA transcripts. Most compelling is 
the presence of target-site duplications of 8-20 bp, with the point 
of insertion of the polyA tail almost universally occurring within 
a TTTTAA or very similar motif (Fig. 2b), which is the classic 
signature of LINE retrotransposition ' 16 . Where we observed 
target- site deletions at insertion points the sequence showed less 
consensus, although there was still predilection for AT-rich 
sequences (Fig. 2b). We also found a somatic pseudogene inserted 
at the breakpoint of a somatically acquired genomic 
rearrangement (Supplementary Fig. 4), a process that can occur 
in the germline 17 . 

Interestingly, 21% (9/42) of somatic pseudogenes showed a 
single inverted rearrangement within the cDNA occurring away 
from the canonical splice sites of the gene (Fig. 2a,c, 
Supplementary Fig. 5). Such internal inverted rearrangements 
are observed in ~ 8% of LI LINE elements in the genome 18 and 
are thought to occur by a 'twin priming' mechanism. Here, not 
only does the TTTT overhang prime reverse transcription from 
the polyA tail, but the opposite strand, nicked 10-20 bp 
downstream, can fold back onto the mRNA transcript 
internally and prime another cDNA copy. These two partial 
copies of the mRNA are then resolved by non-homologous end- 
joining, leading to an inverted rearrangement 19 . Our data are 
consistent with this mechanism. The internal rearrangements 
lead to a shuffling of the exon order and direction in the 
pseudogene, but without duplication of sequence (Fig. 2c). The 
insertion points of the 5' end of the inverted pseudogenes showed 
1-3 bp of microhomology, consistent with a second priming 
event. Microhomology of between 1 bp and 4 bp was also usually 
present at the internal rearrangement breakpoint, indicative of 
non-homologous end- joining repair. 

The 629 primary cases included data from 18 different tumour 
types (Supplementary Fig. 6, Supplementary Table 2). Somatic 
pseudogenes were most frequent in non- small cell lung cancer 
(19%; 5/27 patients) and colorectal cancer (18%; 2/11). The fact 
that such events occur in colorectal and lung cancers is consistent 
with the observation of high rates of somatic retrotransposition of 
LINE elements in these tumour types 7-9 . 

For four patients with somatic pseudogenes, we sequenced 
more than one tumour sample (Fig. 2d). In one patient, we 
analysed four clonally related samples from two bronchial lesions, 
both of which evolved from carcinoma in situ to invasive cancer. 
We found somatic pseudogenes common to all four lesions and 
others unique to a single lesion (Fig. 2d). In another case of lung 
cancer we found four somatic pseudogenes, all of which were 
present in both the carcinoma in situ and the invasive tumour. 
Similarly, for two patients with colorectal cancer, we sequenced 
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Table 1 | Somatic pseudogenes identified across 660 cancer samples. 



Sample 


Cancer type 


Pseudogene 


Exons 


Full-length 


Insertion site 


Target-site dup. 


Internal inversion 


PD7354c 


Lung 


ATP11B 


19-30 


No 


Unmapped 


NA 


No 


PD7354h 


Lung 


ANLN 


15-24 


No 


Intergenic 


12 bp 


Yes 


PD7354h 


Lung 


KTNl 


11-14; 44 


No 


Intron 15 PSD3 


7 bp 


Yes 


PD7354h 


Lung 


UBAP2L 


21-26 


No 


Intergenic 


18 bp 


No 


PD7354h 


Lung 


MYBL2 


11-14 


No 


Intergenic 


NA 


No 


PD7354h 


Lung 


AKR1C3 


1-9 


No 


Unmapped 


NA 


No 


PD7354k 


Lung 


APP 


14-16 


No 


Unmapped 


NA 


No 


PD7354k 


Lung 


CD55 


1-11 


Yes 


Unmapped 


NA 


No 


PD7354k 


Lung 


CENPF 


13-20 


No 


Intergenic 


None 


No 


PD7354k 


Lung 


DDX46 


6-12 


No 


Unmapped 


NA 


No 


PD7354k 


Lung 


PSAP 


5-14 


No 


Unmapped 


NA 


No 


PD7354k 


Lung 


SlOOAll 


1-3 


Yes 


Upstream CLMP 


None 


No 


PD7354r 


Lung 


FM03 


5; 8-9 


No 


Upstream TIPIN 


10 bp 


No 


PD7354r 


Lung 


HDAC1 


6-14 


No 


Intergenic 


9 bp 


No 


PD7354r 


Lung 


HDDC2 


1-6 


Yes 


Rearrangement 


None 


No 


PD7354r 


Lung 


0DC1 


8-12 


No 


Intergenic 


5 bp 


No 


PD7355a 


Lung 


GOTl 


5-9 


No 


Intron 1 CDH12 


None 


No 


PD7356c 


Lung 


AKR1C1 


6-9 


No 


Intergenic 


14 bp 


No 


PD7356c 


Lung 


MYH9 


28-41 


No 


Intergenic 


10 bp 


Yes 


PD7356c 


Lung 


RCORl 


5-8; 12 


No 


Intron 3 ESR1 


16bp 


Yes 


PD7356c 


Lung 


TSPAN6 


1-4; 6-8 


Yes 


Intron 3 RIT2 


14 bp 


No 


PD7356i 


Lung 


FOPNL 


1-5 


Yes 


Intron 11 SND1 


10 bp 


No 


PD4864b 


Lung 


FNTA 


1-9 


No 


Intergenic 


NA 


No 


PD4864b 


Lung 


KRT14 


1-8 


No 


Unmapped 


NA 


No 


PD4861b 


Lung 


ARHGEF9 


4-10 


No 


Intergenic 


None 


No 


PD4861b 


Lung 


HNRNPD 


4-9 


No 


Intron 1 RAB8B 


9 bp 


No 


PD6377a 


Gastric 


SLC12A1 


25-27 


No 


Intron 15 ADAMTS3 


NA 


Yes 


PD6384a 


Gastric 


P0F1B 


1-11; 17 


No 


Intergenic 


NA 


Yes 


PD6388a 


Gastric 


MRPL11 


1-5 


Yes 


Intron 1 ABCA13 


15 bp 


No 


PD7261a 


Colorectal 


HDAC1 


1-14 


Yes 


Intron 1 RASA2 


NA 


No 


PD9061a 


Colorectal 


CAST 


18-29 


No 


Intergenic 


NA 


No 


PD6022a 


Gastric 


ARF4 


1-6 


Yes 


Unmapped 


NA 


No 


PD6037a 


Cholangiocarcinoma 


CSDE1 


7-16 


No 


Unmapped 


NA 


No 


PD4226a 


Breast 


THUMPD2 


8-10 


No 


Unmapped 


NA 


Yes 


PD4226a 


Breast 


COBL 


10-13 


No 


Unmapped 


NA 


No 


PD6368a 


Chondrosarcoma 


SEP15 


1-5 


Yes 


3' UTR WARS2 


NA 


No 


LB771-HNC 


Cell line (H&N) 


KRT6A 


1-3; 4-9 


No 


3' UTR MLL 


17 bp 


Yes 


LB771-HNC 


Cell line (H&N) 


KIF18A 


7-14 


No 


3' UTR BIN3, KIAA1967 


None 


No 


NCI-H2009 


Cell line (lung) 


C9orf41 


3-8; 8 


No 


Intergenic 


17 bp 


Yes 


NCI-H2009 


Cell line (lung) 


PTPN12 


12-17 


No 


Exon 1 MGA 


None 


No 


NCI-H2009 


Cell line (lung) 


IBTK 


1-29 


Yes 


Intergenic 


None 


No 


NCI-H2087 


Cell line (lung) 


ARPC5 


1-4 


Yes 


Intergenic 


16bp 


No 



Insertion sites that could not be mapped may be due to insertion into repetitive sequences or failure of exon capture to include UTRs. H&N, head and neck carcinoma; NA, not available. 



the primary tumour and a liver metastasis. In both cases, the 
somatic pseudogene was present in both primary and metastasis. 
Taken together, these data indicate that somatic pseudogenes can 
form relatively early in cancer development, before the tumour 
becomes invasive or metastatic, but can also occur in more 
advanced stages of disease. 

Highly expressed transcripts were especially likely to be 
templates for somatic pseudogenes (Fig. 3a). Overall, 63% of 
genes acting as the template for somatic pseudogenes were among 
the top quartile of expressed genes for that tumour type 20 , a 
statistically significant enrichment (P< 0.0001, Wilcoxon test). A 
similar property has been noted for germline pseudogenes 21 . This 
may explain why we see several examples in which a gene was 
recurrently retrotransposed as a somatic pseudogene. For 
example, we found two somatic pseudogene copies of HDAC1, 
one in a colorectal cancer and one in a lung cancer 
(Supplementary Fig. 7a). Similarly, we see somatic pseudogenes 
involving two aldo-keto reductase genes, AKR1C1 and AKR1C3 
(Supplementary Fig. 7b). The aldo-keto reductase proteins 
activate polycyclic aromatic hydrocarbons in tobacco smoke, a 
key step in inducing the genotoxicity of these critical 



carcinogens 22,23 . Overexpression of these genes has been 
documented in lung cancer 24 , which may explain why they 
recurrently serve as templates for somatic pseudogenes in this 
disease. 

Like any mutational process, the majority of somatic 
pseudogenes are likely to be passenger mutations, but a few will 
have functional consequences that may be oncogenic. Somatic 
pseudogenes could exert functional consequences through many 
different mechanisms 12 , including fusion gene formation, 
increased expression of the pseudogene, disruption of a gene at 
the insertion site, sequestration of miRNAs from the template 
gene 13 and production of antisense transcripts 14 . Among the 31 
somatic pseudogenes with insertion sites identified, nine were 
inserted into introns and three into 3' UTRs. Of these, five were 
in the same orientation as the host gene, three in the opposite 
orientation and four had internal inverted rearrangements. 

To assess transcriptional consequences of somatic pseudogene 
insertion, we performed RNA- sequencing on two primary 
cancers and three cell lines, as well as analysed 20 non- small 
lung cancers sequenced by TCGA 25 (Supplementary Table 3). 
Across these samples, there were 16 somatic pseudogenes, of 
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FOPNL 



Split in read 

Insert between paired reads 
— ►Linked to chr7: 127Mb 
I Mapping position of read 



SND1: Intron 11 

► r 



chr7: 127,462,764 



Target site 
duplication 



. CATAATTTTC 
■ GTATTAAAAG 



FOPNL: Somatic pseudogene 



SND1: Intron 11 
1 ► 

Chr7: 127,462,755 



CATAATTTTC 1 
GTATTAAAAG ' 



chr16: 15,959,576 



chr16: 15,982,446 



PolyA tail 



\ 



Target site 
duplication 



ARHGEF9: Somatic pseudogene 



chr3: chrX: 
181,126,747 62,917,097 



chr3: 

181,126,757 




Figure 1 | Somatic pseudogenes. (a) A somatic FOPNL pseudogene in a non-small cell lung cancer. Sequencing reads from high-coverage whole-genome 
shotgun sequencing of the tumour reveal a series of split reads (red) crossing the four canonical exon-exon splice junctions in the gene. In addition, read 
pairs map to adjacent exons with an insert size larger than expected (light brown). At either end of the gene, read pairs linking to chr7 could be identified, 
revealing that the FOPNL pseudogene is inserted into intron 11 of the SND1 gene in the opposite orientation with an intact polyA tail and a target-site 
duplication of 10 bp. (b) A somatic ARHGEF9 pseudogene in a non-small cell lung cancer. The insertion was confirmed as somatic by PCR (Supplementary 
Fig. 2) and capillary sequencing across an exon-exon junction and insertion site. 



which 10 were inserted in intergenic regions, three in introns, two 
in 3' UTRs and one in the first exon of the MGA gene. We found 
no evidence of expression from somatic pseudogenes inserted in 
intergenic regions, whereas one of the three intronic insertions 
was expressed. In this case, a partial KTN1 pseudogene inserted 
into the last intron of PSD3 in a primary squamous cell lung 
cancer. We saw a clear peak of expression arising from the PSD3 
intron immediately adjacent to the insertion, with aberrantly 
mapping read pairs aligning to the KTN1 UTR on one end and 
the PSD3 intron on the other end (Supplementary Fig. 8). Of the 
two insertions into 3' UTRs, both were expressed. In one, a 
KRT6A pseudogene was inserted into the 3' UTR of MLL, the 
latter being a well-known fusion gene in leukaemias 
(Supplementary Fig. 9). The RNA-sequencing data show that 
the last 1.2 kb of the MLL 3' UTR is lost from the mature 
transcript and replaced by the somatic pseudogene, since we 



found paired- end reads spanning the 5' insertion point but not 
the 3' insertion point (Fig. 3b). The 3' UTR of MLL has been 
shown to regulate transcript levels 26 , a feedback loop that could 
be disrupted by such a change, although expression of an aberrant 
transcript does not in itself imply oncogenicity. Similarly, a 
KIF18A pseudogene inserted into the 3' UTRs of two overlapping 
genes on opposite strands, KIAA1967 and BIN3. Reads reporting 
both KIAA1967-KIF18A and BIN3-KIF18A fusion junctions were 
found in the RNA-sequencing data (Supplementary Fig. 10). 

We also observed that somatic pseudogene insertion could 
abrogate expression of a target gene at the insertion site. In lung 
adenocarcinoma cell line NCI-H2009, a PTPN12 pseudogene 
caused an 8 kb target- site deletion that removed the promoter and 
first exon of MGA (Fig. 3c). In corresponding RNA-sequencing 
data, we find only wild-type exon 1 splicing into downstream 
exons, with no reads linking PTPN12 to MGA. Thus residual 
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a 



PolyA tail 



Full-length CDS 

Splice variant | 

Inverted rearrangement 
within pseudogene 
Insertion into gene footprint 
Microhomology at 5' 
insertion site 
Target site duplication 



0.0 



0.2 0.4 0.6 

Fraction of somatic pseudogenes 



0.8 



1.0 



GTTTC 
CTGAG 
GAAGG 
AAGAG 
AAATA 
TTTAT 
TTCAT 
TTGGC 
AACCA 
AGAAC 
GTTTG 
TGCTG 
AGAAA 
TAAAT 
AAATA 
AAATG 



Target site duplication 
I 

[ GGTTATTTCTTT] ATAGC 

[ TTTTCTT] AAAAT 
[AACTCCCTGAGCATTTCT] AGGTA 

[ GCATTCTTTT] AATTA 

[ ATAACTTTT] AAAAT 

[ GTTTT ] AAGAT 

[ TTTAGACCATTTCT ] ATGGA 

[ CAGAATGTTT] ACTTT 

[ AGCCCCTGTTGTCTTT] AATTT 

[ ACTGAGAAAGTTTT ] AACAT 

[ CATAATTTTC ] AACTT 

[ AGAAATCTT] ATTCC 

[ GAAAGAGAGCCTTTC ] AAAGG 

[ TTTTTTTATTTTTTAT] TTTTT 

[ ATTTTGATATATTTCT ] AATTA 

[ TTTTGTTTTAATTTT] AATTC 



PolyA tail insertion site- 



Target site deletion j 

l 0 

GTAGA (GTATG . . . ATAGA) AGAGA 
CAGTT ( ) AAAAA 

TAGTT ( TTAAGATGATTT ) AAGAA 
ACACC ( TTCTACAAG) ATATT 

CCCTC(CAGCA. . . GATCA) GACCT 
CAAGA (GCTTC . . . AAAAA) AAAAA 



C Chr11 intergenic 
I - 

chr11: 101,291,485 



ANLN: Somatic pseudogene 



«. AAAGAAATAACC 
* TTTCTTTATTGG 



/ 



Target site 
duplication 



chr7:36,463,459 chr7:36,489,433 



chr7:36,489,346 



250 bp 



chr7:36,493,400 

Inverted rearrangement 



Chr11 intergenic 



|A. .A 
T. .T 



chr11:101,291,474 



AAAGAAATAACC > 
TTTCTTTATTGG"* 



\ 



Target site 
duplication 



Lung cancer (PD7354) 

ATP1 



KTN1, UBAP2L, 
CENPF, AKR1C3 




Carcinoma 
in situ 

MYBL2, ANLN 

Invasive cancer 

Invasive cancer 



APP, CD55, DDX46, Carcinoma 
PSAP,S100A11 in situ 



Lung cancer (PD7356) 
AKR1C1, MYH9, 
RCOR1, TSPAN ^ ^ 

Carcinoma Invasive 
in situ cancer 

Colorectal cancers 
HDAC1 - 



FM03, HDAC1, 
HDDC2, ODC1 



Primary Metastasis 



# PD7261 



CAST 



i PD9061 



Primary Metastasis 



Figure 2 | Properties of somatic pseudogenes. (a) Histogram showing the fraction of somatic pseudogenes with particular features, (b) Sequences of 
target-site duplications (between square brackets) and adjacent genomic regions, showing that the polyA tail of the somatic pseudogene inserts in the 
consensus TTTTAA sequence between the TTTTand AA. Target-site deletions were also occasionally seen (deleted sequence between the round 
brackets), (c) An example of an internal inversion in a somatic pseudogene, inserted into intergenic sequence. The insertion was confirmed as somatic by 
PCR. (d) Phylogenetic trees for four patients in whom multiple samples were sequenced, showing at which stage during the evolution of the cancer somatic 
pseudogenes were acquired. 



expression is derived from the intact MGA allele, and loss of the 
promoter and exon 1 has eliminated expression from the 
disrupted allele. MGA encodes a MAX- interacting protein 27 , 
with focal deletions and inactivating mutations in lymphoid 
malignancies 28 ' 29 . From a compendium of 7,651 exome 
sequences 30 , we previously found that MGA is a likely tumour 
suppressor gene on the basis of a statistically significant excess of 
nonsense mutations (q<10 -6 ) especially in lung adenocarci- 
noma 31 (Supplementary Fig. 9b). 



The diversity, complexity and iniquity of mutational processes 
operative during the development of cancer have been laid bare by 
whole-genome sequencing, and here we describe another novel 
mechanism of somatic mutation. There has been much recent interest 
in how retrotransposition of repeat elements in somatic cells reshapes 
the genome during normal brain development and during cancer 
development 4,8 . The formation of pseudogenes in somatic cells repre- 
sents a companion mutational process, with considerable flexibility in 
potential mechanisms to alter a cell's transcriptional activity. 
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MGA 

| i I I ) [ )))))))) 




Figure 3 | Tissue-specific patterns of somatic pseudogenes. (a) Expression of template genes for somatic pseudogenes (individual points) compared 
with all genes for the most frequently affected organ sites. The violin plot formulation for all genes shows the median (white point), interquartile 
range (thick black line) and 1.5 x the interquartile range (thin black line). The coloured shapes denote a kernel density plot of the distribution of gene 
expression levels for all genes. Due to the non-normal distribution, we used Wilcoxon rank-sum tests to test whether expression levels of template genes 
for somatic pseudogenes were different to that expected, (b) RNA-sequencing data showing expression of the MLL-KRT6A fusion gene, (c) Deletion of the 
promoter and first exon of MGA during somatic pseudogene insertion, leading to abrogation of expression from that allele. 



Methods 

Sequence data. Sequencing data comprised low coverage (2-5 x genome 
coverage) paired-end sequencing for genomic rearrangements 32-34 , high-coverage 
(30-40 x ) paired-end shotgun sequencing 35 ' 36 and targeted pull-down and 
sequencing of the coding exome 37 ' 38 generated at the Wellcome Trust Sanger 
Institute as described, both published and unpublished, on the Illumina HiSeq 
platform (Supplementary Table 1). In total, 660 cancer samples (629 primary 
samples and 31 cell lines) spanning 18 tumour types were analysed (Supplementary 
Table 2). The identity of cell lines was confirmed by STR testing and were obtained 
from ATCC. The Cambridgeshire Local Research Ethics Committee approved the 
studies and all patients gave informed consent. 



Pseudogene detection. Data were aligned to both the reference genome 
(GRCh37) and the reference Ensembl transcriptome using BWA 3? and the 
alignment coordinates from the transcriptome mapping were converted back to 
genome space. Owing to the different characteristics of exome versus genome data, 



the analyses of these alignments were optimized to deal with the different data 
types (Supplementary Fig. 1). It is important to note that, regardless of the analysis 
method, exome versus genome data are likely to have differential sensitivities for 
detecting pseudogenes. 

Targeted exome data includes less than 2% (50 Mb) of the human genome and 
provides high depth over exons but does not contain the intronic and intergenic 
regions where the majority of genomic structural rearrangements lie. Analysis of 
genomic alignments through our standard structural variant algorithm 35 ' 40 was 
therefore sufficient for pseudogene detection in these data, as it provided high 
confidence calls for exon-exon junctions with few calls representing other types of 
genomic rearrangement. Candidate pseudogenes were required to have at least two 
apparent deletion events in the same gene, that is, involvement of at least three 
exons, with breakpoints separated by a distance of at least 500 bp but less than 
50 kb, a size range that includes the majority of introns of the human genome 41 . 
To remove polymorphic germline pseudogenes, groups were excluded if they 
contained any read pairs from either a matched normal or an unmatched normal 
panel of nine unrelated individuals. Breakpoints were further excluded if the 
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intervening intron(s) aligned with a deletion annotated in the Database of Genomic 
Variants, as this indicates a potential germline pseudogene. Transcriptome 
alignments were used to identify reads that split across exon-exon boundaries by 
visual inspection in IGV (Integrative Genomics Viewer). 

Due to the large size of the genome, the presence of repetitive elements, which 
tend to be more prevalent in non-coding sequences, and the presence of multiple 
genomic structural rearrangements in many tumour types, genome data required a 
custom pipeline to be developed (Perl scripts available on request). After alignment 
to the transcriptome and conversion of the read mapping coordinates back to 
genome space, read pairs that are derived from pseudogenes are characterized by a 
large insert size, due to either reads mapping to adjacent exons, or a split in the 
CIGAR string where two halves of a read have aligned across a splice junction in 
the transcriptome. Read pairs were therefore filtered on these characteristics. 
Retained pairs were required to have an insert size of between 300 bp and 50 kb, 
two or fewer splits in the CIGAR string with the first and last match being at least 
10 bp, and a mapping quality of at least 10. Read pairs passing these criteria were 
annotated against gene positions at both the 5' and 3' end of both read 1 and read 2. 
Read pairs where all four coordinates mapped within the same protein coding gene 
in Ensembl were retained. The number of supporting read pairs per gene for the 
tumour, matched normal and an unmatched normal panel of 23 low-depth 
genomes were calculated and genes with no supporting reads in the normal/normal 
panel but more than three supporting reads in the tumour were validated by visual 
inspection in IGV. 



Insertion site mapping. Somatic pseudogenes were best distinguished from 
contamination of the genomic DNA library by either RNA- sequencing libraries or 
plasmids containing cDNA expression constructs by mapping the insertion sites of 
the pseudogene to base-pair resolution. Mappings to the reference genome were 
used to identify insertion sites using reads mapping between the pseudogene and 
elsewhere in the genome and/or reads that aligned to one side of the insertion point 
with soft clipping. Soft-clipped reads were realigned using BLAT, to obtain exact 
breakpoint coordinates. However, the feasibility of mapping insertion points 
depended on the data type and the position of the insertion. 

In whole-genome shotgun sequencing, the insertion sites were identified for 
73% (16/22) of pseudogenes whereas 86% of insertion sites could be mapped in 
low-depth genomes (Supplementary Data 1). The lower rate of mapping in spite of 
the higher coverage in whole-genome shotgun sequencing may reflect pseudogene 
insertions into repetitive sequences in PD7354, which would be unmappable given 
the short read lengths characteristic of Illumina sequencing. 

Alignments to either the genome or transcriptome rarely provided insertion 
sites for exome data. The majority of somatic pseudogenes consist of the 3' UTR 
and varying extents of 5' truncation. As UTRs are not included in the bait design 
for exome pull-down, insertion junction sequences are commonly not represented 
in the bam files of these data. The 5' end insertion sites were mapped in 29% (2/7) 
of somatic pseudogenes, with 40% of the remaining pseudogenes being full-length 
and therefore including a 5' UTR. The 3' insertion site could only be mapped in 
14% (1/7) of pseudogenes identified in exome data. To assess the relative sensitivity 
of genome versus exome data for the detection of pseudogenes and capacity to map 
the insertion site, we sequenced both the exome and whole genome of three cell 
lines containing somatic pseudogenes to typical depth (Supplementary Table 4). 
The number of read pairs supporting the presence of a pseudogene through 
exon-to-exon mappings and the number of discordant read pairs reporting the 
insertion site for matched exome and genome data were compared. Overall, exome 
data shows greater sensitivity for splice junctions, whereas genome data are more 
likely to include the insertion junctions. 



Statistical analysis of recurrent point mutations. To assess whether any of the 
insertions disrupted tumour suppressor genes at the insertion site, we analysed 
point mutation calls from 7,651 exomes available from previous publications, 
TCGA, ICGC and in-house for evidence of a higher than expected rate of inacti- 
vating mutation, using an adaptation of the method described in Greenman 
et al., 42 . This analysis has been described in detail elsewhere 31 . 



PCR validation and capillary sequencing. The somatic status of pseudogenes was 
confirmed by designing primer pairs between the pseudogene and insertion site 
(where known) and performing PCR on both tumour and matched normal 
genomic DNA. DNA for both tumour and matched normal samples was available 
for 12/16 somatic pseudogene insertion sites predicted from whole- genome 
(>30 x ) data, all 10 insertion site predictions from low- depth data, and all 3 
mapped exome insertions. A total of 92% (11/12) whole-genome predictions 
validated, with the remaining 8% (1/12) not producing product in either tumour or 
normal, consistent with PCR/primer failure. All (3/3) exome predictions had 
confirmed somatic insertion points. Unsurprisingly, low-depth genomes, with no 
or low coverage sequencing of matching normals, gave a high rate of germline 
pseudogenes. Overall, 40% (4/10) PCRs showed product in both tumour and 
normal; 50% (5/10) were somatic and the remaining 10% (1/10) failed to produce 
product in either tumour or normal. The germline pseudogenes have not been 
included in this manuscript. 



RNA-sequencing. RNA- sequencing was performed on the three cell lines in which 
somatically//?? vitro acquired pseudogenes were identified and two of the primary 
samples in which somatic pseudogenes were identified. 

Total RNA was extracted using Trizol and sequencing libraries prepared by 
standard Illumina RNA-sequencing of polyA- selected RNA. A 75-bp paired-end 
sequencing was used and between 24 and 3 1 Gbp of data were generated per 
sample. TopHat (v 1.3.3) was used to map reads to the genome. The quality of the 
RNA-Seq data was assessed using a number of metrics, including absence of 3' bias 
and low amounts of ribosomal RNA. 

We confirmed the positions, insertion sites and expression of somatic 
pseudogenes in RNA-Seq using two algorithms (Shlien A., unpublished). First, we 
looked for discordantly mapped read pairs where one end mapped to the 
pseudogene and the other near the integration site. We then evaluated grouping of 
pairs by the consistency of the orientation in which the reads mapped, their 
position and their overlap with regions of high homology (multi-mapping). 
Second, we looked for reads that could be partially mapped to the pseudogene and 
partially mapped to the insertion site (split reads). To do so, we mapped all the 
RNA-seq data to a transcriptome database containing all known exon-exon 
junctions. We then shattered and remapped all of the remaining reads into k-mers 
(13 bp) to an indexed version of the human genome. Mapped fragments were 
extended one base at a time so long as they maintained a single mapping position. 
In this way we resolved the breakpoints of the expressed somatic pseudogenes. 
Note that in the absence of fusion transcript formation, re-expression of the 
pseudogene from its insertion site was indistinguishable from expression from the 
template copy. 
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